Comprehensive genomic and epigenomic analysis in cancer of unknown primary guides molecularly-informed therapies despite heterogeneity

The benefit of molecularly-informed therapies in cancer of unknown primary (CUP) is unclear. Here, we use comprehensive molecular characterization by whole genome/exome, transcriptome and methylome analysis in 70 CUP patients to reveal substantial mutational heterogeneity with TP53, MUC16, KRAS, LRP1B and CSMD3 being the most frequently mutated known cancer-related genes. The most common fusion partner is FGFR2, the most common focal homozygous deletion affects CDKN2A. 56/70 (80%) patients receive genomics-based treatment recommendations which are applied in 20/56 (36%) cases. Transcriptome and methylome data provide evidence for the underlying entity in 62/70 (89%) cases. Germline analysis reveals five (likely) pathogenic mutations in five patients. Recommended off-label therapies translate into a mean PFS ratio of 3.6 with a median PFS1 of 2.9 months (17 patients) and a median PFS2 of 7.8 months (20 patients). Our data emphasize the clinical value of molecular analysis and underline the need for innovative, mechanism-based clinical trials.

C ancer of unknown primary site (CUP) is defined as metastatic cancer without detection of a tumor of origin and accounts for 2-4% of all malignancies. Treatment options are limited and in the majority of cases insufficient 1,2 . Routine diagnostic measures contain blood analyses, histopathology including immunohistochemistry as well as imaging of thorax, abdomen, and pelvis. Even though the prognosis is unfavorable in the majority of cases,~15% of patients can be categorized into more favorable subsets that benefit from treatment similar to therapies administered to patients with corresponding primary tumors and metastatic spread. Therefore, identification of those favorable CUP subtypes is important for efficient treatment 3 . Currently, CUP subtype identification is mainly based on disease localization and a systematic, in-depth immunohistochemical assessment, but considerable research efforts aim at improving its accuracy or applicability to more patients by new technologies. RNA expression profiling has been reported to identify the tissue of origin in 80-90% of cases [4][5][6][7] . Epigenetic profiling using DNA methylation signatures has been reported to predict the tissue of origin in almost 90% of cases 8 . Nevertheless, to date there is no clear evidence from prospective randomized trials that site-specific therapy based on these new approaches leads to improved patient outcome 9,10 .
Broad, mainly panel-based, molecular analyses of patients with CUP revealed profound genetic heterogeneity with most frequent alterations in common cancer-related genes like TP53, KRAS, CDKN2A, and SMAD4 [11][12][13] . In many cases genetic alterations were identified 14 that potentially can be addressed therapeutically 15 . Still, the clinical benefit of molecularly guided treatment in CUP remains unclear.
Here we describe a cohort of 70 CUP patients characterized by comprehensive molecular profiling within the MASTER program of the National Center for Tumor Diseases and the German Cancer Consortium (NCT/DKTK) combining whole-exome/ genome sequencing, transcriptome and methylome analysis in a clinical workflow to identify therapeutic targets. Based on deep insights into the highly individual molecular landscape of the disease, molecular tumor board recommendations led to genomics-guided treatment in 20 patients.

Results
Patients. Seventy CUP patients were included of whom 61 met the criteria defined by the ESMO clinical practice guideline 3 . In the remaining nine cases documentation of necessary initial imaging procedures was lacking (such as CT scans of thorax, abdomen and pelvis). Median age was 46 years (range 18 to 73). 27/70 (39%) patients were male, 43/70 (61%) were female. Median follow-up time was 25.9 months. Median overall survival (OS) was 22.1 months. 38 patients died during the observation period. Documentation of previous therapies and tumor burden was available for 69 patients. The median number of systemic therapies prior to sample submission for these 69 patients was 1 (range 1-7). Detailed patient characteristics are depicted in Table 1 and Supplementary Data 1 and 2.
Additionally, we were able to identify several oncogenic gene fusions of interest (Supplementary Data 3). In six patients, we detected a gene fusion involving FGFR2, which always occurred at the same splice-site of the FGFR2 gene (Fig. 3a). These fusions are typical for a subgroup of cholangiocarcinoma 24 . Usually, the intact kinase domain of FGFR2 is fused with a gene, which provides a dimerization/oligomerization domain that facilitates constitutive activation of downstream RAS/MAPK and PI3K/ AKT pathways. Moreover, they can be addressed therapeutically using FGFR inhibitors 25 . EML4-ALK fusions were detected in the tumors of three patients (3/70; CUP-02, CUP-20, CUP-11) and could be confirmed by immunohistochemistry in all cases. At the same time CUP-33 was reported to have a previously detected EML4-ALK fusion that could not be found in our analysis, most likely due to low quality of the biopsy sample. EML4-ALK fusions are characteristic for a subtype of non-small-cell lung cancer (NSCLC) and can be addressed therapeutically using ALK inhibitors 26 . We detected two EWSR1-WT1 fusions (CUP-15, CUP-42), which are typical for desmoplastic small round cell tumors 27 . Two patients diagnosed with melanoma of unknown primary by histologic analysis (CUP-34 and CUP-64) harbored two BRAF fusions each, of which CUP-34 also had a BRAF V600E mutation.
CUP-14 had an ARHGAP26-CLDN18 fusion, which has been described in gastric signet-ring cell carcinoma 28 . A previously undescribed MXI1-NUTM1 fusion was detected and confirmed by routine diagnostics in CUP-04 (Fig. 3b). NUTM1 fusions define NUT midline carcinomas 29 and usually involve BRD3 or BRD4 as fusion partners. In CUP-08 we detected an EZR-ERBB4 fusion, which has been described only once in KRAS wild type invasive mucinous lung adenocarcinoma 30 . Furthermore, we identified an ITSN1-BRAF fusion in CUP-70 that has not been described yet. However, without additional information the fusion event can't be linked to a certain entity since BRAF fusions play a role in various tumor entities 31 .
In total, 20/70 (29%) patients harbored rare genetic alterations that could be linked to specific entities. Not all of them correlated with entity prediction results based on transcriptome profiling and these alterations alone did not necessarily justify diagnostic reclassification. Still, together with omics-based entity predictions they offer meaningful information of diagnostic value and can be useful for further treatment decisions.
Genomics-based treatment recommendations. A dedicated molecular tumor board (MTB) recommended personalized therapy options based on information from DNA and RNA sequencing for 56/70 (80%) patients in our cohort. Median turnaround time from sample submission to MTB was 2.3 months (range 0.7-7.0 months). In four cases without recommendation (4/70) the tumor cell content of the respective biopsy was not sufficient for analysis, in six cases (6/70) the respective patients died before the tumor board could convene, in two cases (2/70) the sample quality was not sufficient for analysis and in the remaining two cases (2/70) no targetable mutations were found. Two patients received a second MTB recommendation (MTBR) based on a follow-up biopsy after progression of disease occurred while being treated with molecularly guided therapy (pazopanib, CUP-42; cetuximab + carboplatin + paclitaxel, CUP-70). In total, 58 MTBRs were issued. The first 56 MTBRs contained 142 drug recommendations, which were grouped into eight different baskets (tyrosine kinases, PI3K-AKT-mTOR, RAF-MEK-ERK, developmental pathways, DNA damage response, cell cycle regulation, immune evasion and others) by the type of pathway a recommended drug interacts with 32 . Tyrosine kinases were the most common basket used for recommendations (47/164, Fig. 4a and Supplementary Data 10). All drug recommendations were sorted into groups by the NCT/ DKTK evidence level they were based on as described by Leichsenring and colleages 33 (Level 1A/B/C, 11/142, 8%; Level 2A/B/C, 89/142, 63%; Level 3, 31/142, 22%; Level 4: 11/142, 8%). 18 samples had a mutational burden of at least 100 non-silent SNVs and coding indels, which was defined as hypermutation by the MTB and used as a potential rationale for immune checkpoint inhibitor recommendations (immune evasion basket).
Genomics-based systemic treatment. Twenty (20/56, 36%) patients were treated in accordance with MTBRs using 30 applied drugs or drug combinations (Fig. 5 and Supplementary Data 11). Treatments were given off-label at the discretion of the treating oncologist. Recommendations based on the immune evasion basket were most likely to be applied (Fig. 4b). The distribution of NCT/DKTK evidence levels of the first clinically applied recommendations showed a similar distribution as the Fig. 2 Methylome and transcriptome-based clustering. a tSNE plot based on the 5000 most variant CpG sites across the TCGA pan-cancer cohort (n = 8024, 33 different cancer entities in 32 entity baskets). The tSNE analysis was jointly performed on the complete TCGA and MASTER CUP samples (n = 55) to ensure comparability within the landscape. This subplot shows TCGA samples only. While many TCGA entities show distinctive clusters, some do overlap with other entities. b This subplot illustrates all MASTER CUP patients (black) on top of the TCGA sample landscape (gray). CUP samples were close to many entities that did not necessarily cluster distinctively (e.g., gastrointestinal tumors). c Transcriptome-based tSNE clustering of 33 different cancer entities in 32 baskets using TCGA data without CUP patients (n = 1809). d Transcriptome-based tSNE clustering of MASTER CUP patients (black, n = 55) among the background of TCGA-based clusters (gray). As with the methylome-based clustering, a notable fraction of the samples are found in the diffusely structured center of the tSNE clustering. e Venn diagram depicting concurring results between methylome-based CUP entity predictions (comparison to TCGA) and transcriptome-based entity predictions (comparison both to TCGA and MASTER; each depicted in separate groups). 48 patients of the CUP cohort had predictions based on all three methods and were therefore eligible for comparison (Supplementary Data 8). f Venn diagram depicting concurring results between both transcriptome-based CUP entity predictions (TCGA and MASTER comparison). 55 patients of the CUP cohort had transcriptome-based predictions (Supplementary Data 8). Source data are provided as a Source Data file.
one of all drug recommendations (Level 1, 2/20, 10%; Level 2, 15/20, 75%; Level 3, 1/20, 5%; Level 4, 2/20, 10%). To evaluate clinical benefit, we calculated the ratio (PFSr) of the progression-free survival time associated with the first applied therapy recommended by MASTER (PFS2) and the progressionfree survival associated with the last prior systemic therapy (PFS1). Three (3/20, 15%) patients did not progress during their last systemic therapy before they received a recommended therapy. Therefore, neither PFS1 nor PFSr could be determined for them, but recommended therapies translated into a PFS2 of 6.0, 10.0 and 11.1 months, respectively. Mean PFSr for the other 17 patients was 3.6. Median PFSr was 2.3 with a range from 0.2 to 16.4. Median PFS1 was 2.9 months (n = 17) and median PFS2 was 7.8 months (n = 20, Supplementary Fig. 4). In one patient the PFS defining event was death, in the others it was progressive disease. To improve concordance with physicianperceived clinical benefit, we calculated the modified PFS2/ PFS1 ratio (mPFSr) as described by Mock, Heilig and colleagues 34 , which resulted in a mean mPFSr of 5.0 (median mPFSr = 2.7; range 0.2 to 12.0; Table 2).
13/17 treated patients had a mean PFSr > 1.3, which was considered as clinical benefit since this value has been frequently used as a measure for positive clinical outcome in precision oncology trials [35][36][37] . Four of them received treatment with immune checkpoint inhibitors, three with ALK inhibitors, three with multikinase inhibitors, one with trastuzumab, one with vismodegib and one with olaparib plus gemcitabine (Supplementary Data 12). Three patients had a PFS2 > 1 year, namely CUP-57 (26.5 months, trastuzumab), CUP-18 (23.3 months, nivolumab) and CUP-08 (12.2 months, olaparib plus gemcitabine).
For patients that did not receive a recommended treatment, we calculated the ratio of the first treatment applied after the MTB (PFSb; median = 3.8 months, n = 12) and the last prior systemic treatment (PFSa; median = 4.8 months, n = 11), which resulted in a mean PFSr of 0.67 (median PFSr = 0.71, range 0.1 to 1.0, n = 11; Supplementary Data 13). Median overall survival of the 36 patients without application of recommended treatments was significantly shorter than of the 20 patients that received a recommended therapy (18.3 months vs. 34.8 months, p = 0.022). Same was true for median PFS2 and PFSb (7.8 months vs. 3.8 months, p < 0.0001). Two patients with PFSb had stable disease ≥6 months or achieved objective remissions (PR/CR; Supplementary Fig. 4). Since our study was not randomized, these results are not controlled for possible confounding factors. Within MASTER, reasons for non-implementation of MTBRs included lack of availability or reimbursement of recommended treatments, deterioration of a patient's general condition and death before treatment application 38 . Further data are provided in Supplementary Data 13 and 14.
The median number of prior systemic palliative therapies that patients with an applied MTBR had received was three (range 1−7). Eleven had already been treated with targeted therapies indicating that clinical benefit may be achieved even in heavily pretreated patients.

Discussion
In addition to known recurrent mutations in CUP, our comprehensive WGS/WES approach detected a variety of rare genetic alterations, which were relevant for molecularly targeted treatment decisions. This suggests that comprehensive molecular analysis is particularly well-suited for this heterogeneous disease.
On the genomic level, we observed frequent mutations in wellknown cancer-related genes such as TP53, MUC16 and KRAS. The majority of these common alterations has previously been described in studies using gene panel sequencing. When using a 50 gene panel, Löffler and colleagues described TP53, KRAS, CDKN2A, and SMAD4 as the most frequently mutated genes and CDKN2A as the most frequently deleted gene in CUP 11 . Varghese and colleagues reported a variety of targetable alterations in 45/ 150 CUP patients using MSK-IMPACT, a deep-coverage hybridization capture-based assay encompassing 341 (later expanded to 410) cancer-associated genes accompanied by WES in 13 cases 14 . The most commonly mutated genes were TP53, KRAS, CDKN2A, KEAP1, and SMARCA4 and 15/150 patients received targeted therapies 14 . These common mutations are involved in a variety of cell processes and do not offer a clear rationale for targeted therapies available at the moment. Furthermore, we and others report a substantial amount of pathogenic germline mutations amongst CUP patients [39][40][41][42] . These can have a direct impact on the patient and their families. Therefore, germline testing should be considered, especially for young patients with CUP or patients with previous malignancies.
On the transcriptomic level, we detected a variety of therapeutically relevant gene fusions. In addition, we classified tumors based on transcriptomic and epigenetic information, which was complemented by specific disease-defining alterations and by presence of certain dominant mutational signatures. Nevertheless, only a third of the entity predictions based on methylome and transcriptome did match, which may be explained by tumor cell content, RNA quality, differences between TCGA and the MASTER cohort composition, as well as differences in the methods used. The question whether site-specific therapies are beneficial for CUP patients is still a matter of ongoing debate. Using a 92-gene RT-PCR cancer classification assay, Hainsworth and colleagues reported that site-specific therapy leads to significantly improved survival when clinically more responsive tumor types were predicted 9 . In contrast, Hayashi and colleagues reported that site-specific treatment based on microarray profiling did not result in a significant improvement in 1-year survival compared with empirical paclitaxel and carboplatin, although prediction of the original site seemed to be of prognostic value 43 . Similarly, a meta-analysis by Rassy et al. showed no significant survival benefit with site-specific in comparison to empiric chemotherapy. At the same time the heterogeneity across the available data demonstrates that further well-designed trials are needed 10 and that a reliable classification method for the attribution of a CUP case to a specific tumor entity is still to be developed. Moran and colleagues used microarray DNA methylation signatures (EPICUP) to predict a primary cancer of origin in 188 (87%) of 216 CUP patients. In this study, patients with EPICUP diagnoses who received a tumor type-specific therapy showed improved overall survival compared with that in patients who received empiric therapy 8 . Prospective validation of this epigenetic approach is still missing.
In our study, we used epigenetic and transcriptomic analyses together with evaluation of disease-specific mutations to identify the tissue of origin. On the one hand this increased the total number of patients for whom an entity prediction was possible, on the other hand it led to contradictory results in a majority of patients that had several layers of information available. In these cases, it is not clear which data is to be trusted more. Probably, several factors contribute to prediction errors by one or the other method: First, the composition of the MASTER CUP cohort and the TCGA reference cohort differ in several aspects, including metastatic status and represented entity subtypes. Second, differences in the sample preparation protocols between MASTER and TCGA may introduce technical biases, confounding the algorithms of the classifiers. And third, some entities are hard to distinguish using epigenetic/transcriptomic information. For example, the classifiers frequently produced discrepant results concerning hepatic metastases of pancreatic (PAAD) or cholangiocellular (CHOL) carcinoma, which were classified inconsistently as one of PAAD, CHOL or LIHC (liver hepatocellular carcinoma). It is unclear whether the missing success of prospective trials using site-specific treatment is due to limited accuracy in identifying the tissue of origin or due to limited relevance of the site of origin for clinical outcome in the majority of CUP patients. In our cohort, MTB recommendations were not influenced by our methylome-and transcriptome-based entity prediction since it was not available at the time of the MTB. Future trials might benefit from a well-designed integrated classifier taking into account both methylation and transcriptomic data. As previously shown in pancreatic cancer and other hardto-treat entities, comprehensive molecular profiling also offers the opportunity to detect rare or previously unknown therapeutic targets 44,45 . Therefore, in the light of continuously improving options regarding molecular diagnostics and targeted therapies, genomics-based treatment might be the more promising approach.
Our study had several potential limitations. First, our patient population was younger than one would expect for a representative CUP cohort, which can at least be partially explained by the NCT/DKTK MASTER inclusion criteria. Second, our cohort was treated with a wide range of different therapies prior to molecular analysis. Third, our study was not a randomized clinical trial but a prospective observational study. However, the mean PFS2/PFS1 ratio in our cohort was 3.6. 13/17 treated patients (77%) for which a PFS ratio could be determined achieved a ratio higher than 1.3, the originally proposed threshold for assessment of clinical benefit in previous studies like MOS-CATO 01 35 . The median overall survival in our cohort was significantly longer when compared to published data, which may be partially attributed to the young patient age in our cohort. Nevertheless, our results provide evidence that a considerable part of CUP patients may benefit from comprehensive molecular analysis. Although there are case reports about successful use of checkpoint inhibitors in CUP patients 46 , immunotherapy in CUP has not been clinically implemented yet, unless microsatellite instability or DNA mismatch repair (MMR) deficiency have been detected. Our results underline that immunotherapeutic approaches can be efficient in a much larger proportion of CUP patients. Furthermore, we observed a meaningful proportion of CUP patients benefiting from molecularly stratified treatments. Two prospective randomized phase II trials testing novel strategies versus empirical chemotherapy are currently ongoing (CUPISCO, NCT03498521 and CheCUP, NCT04131621).
In conclusion, our findings indicate that comprehensive molecular analysis of CUP patients can be highly beneficial even at late stages or following several rounds of prior treatment.
We provide valuable insight into the heterogenic genomic, transcriptomic and epigenetic landscape of CUP and show potentially actionable alterations in a large proportion of patients. Further prospective clinical studies to assess the impact of genomics-based personalized cancer therapy are warranted.

Methods
Clinical and statistical analysis. The study included 70 patients who were enrolled in the National Center for Tumor Diseases and German Cancer Consortium (NCT/DKTK) Molecularly Aided Stratification for Tumor Eradication Research (MASTER) precision oncology program between May 2013 and July 2018 with follow-up until June 2019 32,47 . All 70 patients had a CUP diagnosis according to their referring oncologists. Clinical data for cohort selection, description and analysis was obtained from the National Center for Tumor Diseases (NCT) Heidelberg and Dresden, as well as from six other comprehensive cancer centers (CCCs) of the German Cancer Consortium (DKTK). The DKTK network includes ten CCCs at eight sites (Berlin, Dresden, Essen/Düsseldorf, Frankfurt/Mainz, Freiburg, Heidelberg, Munich, Tübingen). Demographic data, histopathological diagnosis, location of metastases at the time of enrollment, fulfillment of the ESMO CUP diagnostic criteria, systemic therapies and staging information, genomic information available at the time of the molecular tumor board (MTB), recommendations of the MTB and application of recommended therapies were assessed and documented in a centrally managed electronic data capture system (ONKOSTAR). MTB recommendations were based on the information obtained from DNA and RNA sequencing. Therapeutic options steadily improved over time.
Every tumor board recommendation contains several drugs or drug combinations with different priorities assigned. For our analysis, we included only the first three priorities since there was no drug with a lower priority clinically applied. In some cases, there were fewer than three drugs recommended. All drug recommendations were issued with an NCT/DKTK evidence level reflecting the origin of the information that the respective recommendation was based on 33 . Overall survival (OS) was defined as the time from the date of diagnosis to the date of death or last follow-up. Progression-free survival (PFS) was defined as the time from the date of systemic therapy initiation to the date of death, progressive disease or last followup. Median OS, PFS and follow-up time were estimated using the Kaplan-Meier method, and a log-rank test was used to compare OS and PFS among patient subgroups. PFS of the first applied treatment recommended by the MTB (PFS2) was compared to the PFS of the last prior systemic treatment (PFS1) in each individual patient. If more than one recommended therapy was applied, PFS3 and following were calculated. PFS defining events were progressive disease or death, determined by a medical oncologist via review and assessment of the corresponding medical documents. The progression-free survival time ratios (PFSr) between PFS2 and PFS1 were calculated. Modified progression-free survival time ratios (mPFSr) were calculated following the proposal of Mock and colleagues 34 . For patients that did not receive a recommended treatment, we calculated the ratio of the first treatment applied after the MTB (PFSb) and the last prior systemic treatment (PFSa).
NCT/DKTK MASTER. NCT/DKTK MASTER is a prospective, continuously recruiting, multicenter observational study that provides a standardized diagnostic workflow, which enables molecularly informed decisions for further therapy. Treatment recommendations are made in cooperation with treating oncologists following interdisciplinary discussion in a molecular tumor board. MASTER includes adults with advanced cancer across all entities who are younger than 51 years and patients with rare tumors, including rare subtypes of more common entities, regardless of age. Patients must have exhausted curative treatment options and be in good general condition (Eastern Cooperative Oncology Group performance status of 0 or 1) 38 . Patients with cancers of unknown primary were included regardless of age due to its rarity. Patients provided written informed consent for banking of tumor and control tissue, molecular analysis including germline analysis, and the collection of clinical data under a protocol (S-206/2011) approved by the Ethics Committee of the Medical Faculty of Heidelberg University. The study was conducted in accordance with the Declaration of Helsinki. Patients did not receive participant compensation. Molecularly informed therapies were not part of MASTER but given off-label at the discretion of and by the treating oncologist who obtained informed consent for each therapy. German regulations for off-label treatment allow individual treatment decisions after obtaining informed consent and no IRB approval is required. Costs for off-label drugs can be reimbursed by German health insurances if the patient has a severe disease, if there is no other treatment option available and if there is reasonable hope for treatment success based on available scientific or clinical data.
Entity prediction validation cohort. We used 100 consecutive MASTER patients enrolled between 12/2020 and 06/2021 (Supplementary Data 5) consisting only of entities that are part of TCGA as a cohort to validate all entity prediction methods and measured their accuracy before using them for CUP entity predictions. Transcriptome data was available for 72 patients of the validation cohort (Supplementary Data 6) methylome data for 77 (Supplementary Data 7).
Next-generation sequencing and bioinformatic processing Sample preparation and sequencing. DNA from fresh frozen tumor tissue was isolated using the Allprep DNA/RNA/miRNA Universal Kit (Qiagen) or QIAamp DNA mini (QIAGEN). DNA from formalin fixed paraffin embedded tissue was isolated using the GeneRead DNA FFPE Kit (QIAGEN). DNA from peripheral blood was isolated using QIAamp DNA Blood Mini (Qiagen) or QIASymphony DSP DNA Mini Kit (Qiagen). The isolation process was followed by quality control and quantification using a Qubit 2.0 Fluorometer (Invitrogen) and a TapeStation 2200 system (Agilent). Libraries for whole-genome sequencing were prepared with the Illumina TruSeq Nano (100 ng genomic DNA as input). Both tumor and control (germline) samples were sequenced on 2 lanes Illumina HiSeq X Ten Nucleotide sequence alignment. DNA sequencing reads were mapped to the assembly comprising human genome (1000 Genomes Phase 2 of the Genome Reference Consortium; version hs37d5) and a genome of Enterobacteria phage phiX174 using BWA mem (version 0.7.15) with -T0 parameter as the one different from the default. BAM files were sorted with bamsort (biobambam package, version 0.0.148), and duplicates were marked with markdup (Sambamba package, version 0.6.5) 48 . Sequencing quality statistics are summarized in Supplementary Data 15.
Calling of single-nucleotide variants and small insertions and deletions Somatic. Somatic SNVs were detected from matched tumor/normal sample pairs by an in-house analysis pipeline based on SAMtools mpileup and bcftools and using heuristic filtering as previously described [49][50][51] . In short, initial SNV calls were detected in the tumor BAM by SAMtools (version 0.1.19) mpileup, which considered only reads with minimum mapping quality of 30 (-q 30), and BCFtools, which reported all positions containing at least one high-quality nonreference base (-vcgN -p 2.0). Afterwards these positions were checked in the control sample using mpileup. SNVs were then annotated with ANNOVAR (version November 2014) using GENCODE (release 19). Downstream filtering discarded variants with low support of the alternative allele, occurring in tandem repeats and other read-attracting regions, having PCR strand bias (WGS only), having sequencing strand bias, and having significant bias in the PV4 field of the mpileup output. SNVs with low confidence score were discarded. Somatic SNVs annotated as missense, stopgain, stoploss, or splicing (two base pairs next to an exon boundary) were defined as non-silent. Short indels were detected by Platypus (version 0.8.1) for matched tumor/normal sample pairs 52 . Only ones that had Platypus filter flag PASS or passed custom filters allowing for low variant frequency were retained. Annotation of short indels was done using ANNOVAR (version February 2016). The calls falling into a coding sequence or splice-site were extracted.
Germline. Germline indels were called by Platypus. SNVs identified in the tumor sample were annotated as germline if the control sample had at least 1/30 reads supporting the alternative allele. Germline variants in 101 cancer predisposition genes (Supplementary Data 4) were further filtered for rare variants and against frequent variants in an in-house database before assessment according to AMP-ACMG guidelines. The p-value for age at onset comparison was generated with a two-sided, equal variance t-test.
Tumor ploidy, purity and copy number profile determination. For samples sequenced with WGS, the absolute allele-specific copy numbers, tumor ploidy and purity were estimated using ACEseq (version 5.0.1) 53 .
For samples sequenced with WES, the absolute allele-specific copy numbers were estimated using CNVkit (version 0.9.3) 54 . Segments containing at least 20 heterozygous SNPs were further processed to infer sample ploidy and tumor cell content (TCC) using a method adapted from ACEseq. The algorithm tested each possible combination of TCC (range 0.15-1.0) and ploidy (range 1.0-6.5) to find the local minima and thus optimal solution. If more than one solution was possible, they were visually evaluated and ultimately one of them was chosen. Samples with tumor cell content estimated to be 100% were considered unreliable (due to in fact low tumor cell content) and thus discarded from the results (n = 19).
Microsatellite instability. Microsatellite instability was detected with MSIsensor (version 0.2) 55 . The list of homopolymers and microsatellites generated with the MSIsensor scan command from the 1000 genomes reference comprises 33,386,244 loci. MSIsensor was run with a minimum required coverage of 15 reads for genomes and 30 for exomes in both tumor and control. A score > 3.5 implies microsatellite instability.
RNA sequencing and gene fusion detection. If RNA quality was sufficient, either the Illumina TruSeq RNA (with 1000 ng total RNA) or the Illumina TruSeq mRNA stranded protocol (with 500 ng total RNA) was used for library preparation (TruSeq mRNA stranded since February 2016, Supplementary Data 15). Both are Oligo-dT-based protocols and enrich for mRNA only. Three libraries were pooled and sequenced on one lane HiSeq 4000 100 PE. The reads were aligned to the same reference genome as DNA sequencing data with STAR 2.5.1b 56 . The gene fusions were detected by Arriba pipeline (version 0.8), the software is available on GitHub 57 . Fusions were categorized into high, medium or low level of confidence.
Mutational signatures. Mutational signatures were calculated using R/Bioconductor package YAPSA (version 1.13.3) 58 and COSMIC signatures (version 2) 59 . All identified somatic SNVs were used for the analysis. Six samples with less than 50 SNVs were excluded from the analysis. Mutational catalogs were calculated separately for the whole-exome and the whole-genome sequencing data. The whole-exome catalog was corrected additionally by factors specific for the target capture kits that were used for the preparation of samples. Afterwards, mutational catalogs were normalized by the average length of the coding sequence in Mb (2800 and 30 for WGS and WES, respectively) and merged together for signature decomposition. Exposures were calculated per sample using the set of 30 validated signatures (no artifact signatures) and absolute signature-specific cutoffs with cost factor 6. Corresponding confidence intervals were calculated per sample. Only if their lower bound was greater than 0, the signature was considered to be positively identified.
Methylation-based entity prediction. The Infinium MethylationEPIC BeadChip microarray (850 K) was used for 55 CUP samples and 77 samples of the validation cohort (Supplementary Data 7) to interrogate DNA methylation patterns at genome-wide level. All samples were gathered within the NCT/DKTK MASTER program and had a tumor cell content >30% of the NCT/DKTK MASTER cohort to interrogate DNA methylation patterns at the genome-wide level. The library preparation, hybridization and scanning of the array was performed at the German Cancer Research Center (DKFZ) Genomics & Proteomics Core Facility. The raw data (idat files) were processed into beta values with the minfi R package 60 . Beta values range from 0 to 1 with 0 being a CpG unmethylated and 1 fully methylated.
TCGA (The Cancer Genome Atlas) pan-cancer methylation was retrieved via the curatedTCGAData package 61 . The dataset (33 entities, 8024 samples) comprised both 450k and 27k methylation arrays. The intersection of these arrays comprised 25978 CpGs. For a more meaningful entity prediction colorectal (COAD) and rectal (READ) adenocarcinomas were binned together (COAD/ READ).
The 5000 CpGs for the methylation-based entity predictions were derived after (i) removing known SNPs as previously described 62 , (ii) only considering overlapping CpGs between 850k, 450k and 27k arrays to ensure compatibility with all Illumina methylation arrays and (iii) lastly calculating the top 5000 most variant CpGs across the pan-cancer dataset. The probe IDs are listed in Supplementary Data 16.
Methylation-based entity prediction of 55 CUP samples and 77 samples of the validation cohort was performed by correlating (Spearman correlation) the vector of 5000 CpGs with all samples in the TCGA cohort. The entity of the sample with the highest correlation coefficient was deemed to be the predicted entity.
Similarity in methylation profiles was visualized by tSNE plot with the Rtsne R package 63 . Missing data was imputed with the impute.knn function. The perplexity was set to 100.
Transcriptome-based entity prediction. In order to identify the tissue of origin of a CUP sample based on gene expression, we searched for samples with a similar expression profile in two reference cohorts: the MASTER cohort (comprising 1890 samples from 1814 patients, Supplementary Data 9) and the union of 33 TCGA cohorts (TCGA cohorts with >50 samples were subsampled, yielding a total of 1809 samples). For each reference cohort, we compared the expression profile of the CUP sample to all possible pairwise combinations of reference samples. The reference samples were ranked by the number of times they were more similar to the CUP sample than the other reference sample in a given pair of reference samples. Similarity was measured as the fraction of genes that were upregulated in both the CUP sample and one of the samples in a given pair of reference samples (FPKM > 13), but downregulated in the other reference sample (FPKM < 3). The thresholds for up-and downregulation were determined by means of 10-fold cross-validation on a subset of the MASTER cohort. To mitigate the distortion of the CUP expression profile by contamination from surrounding normal tissue in the bulk RNA-Seq data, we ignored genes found to be upregulated in normal liver tissue (Supplementary Data 17) if the sample was obtained by liver biopsy. The entity of the most similar reference sample was assumed to predict the entity of the CUP sample. If the most similar reference sample was a CUP as well, the most similar non-CUP sample was chosen for prediction instead. The method was validated on 72 patients from the validation cohort (Supplementary Data 6).
Tumor mutational burden (mutations per megabase). For each sample, the numbers of non-silent SNVs and coding indels in the exons of the tumor were added and divided by the length of the coding sequence of the genome (in Mb). The denominator depended on the technology, including different library preparation kits, used for sequencing of a sample. For samples sequenced with WGS, the GENCODE Human v19 gene annotation (GTF format) was taken, coding sequences were identified and merged, and the total length was calculated. For samples sequenced with WES, however, the merged coding sequences were additionally intersected with the coordinates of the corresponding target capture (BED format). All sequence operations were done using bedtools v2.27.1 64 . Calculations resulted in lengths: (i) 35.334619 Mb for WGS, (ii) 31.057260 Mb for WES with SureSelectXT Human All Exon V5 including UTRs and (iii) 30.894643 Mb for WES with SureSelectXT Human All Exon V5 excluding UTRs. Please note that the MTB used the sum of non-silent SNVs and coding indels as measure for TMB.
Homologous recombination deficiency. Homologous recombination deficiency (HRD) was determined using three different methods. The first one was being used for Molecular Tumor Board and could be applied to both whole-exome and whole-genome sequencing data. This method was based solely on results from the copy number analysis, and consisted of the estimation of two parameters: loss of heterozygosity (LOH-HRD) 65 and large-scale state transitions (LST) 66 . An unweighted sum of those produced a score, which classified samples to high (>20), intermediate (11)(12)(13)(14)(15)(16)(17)(18)(19)(20) or low (≤10) level of impaired homologous recombination.
The other two methods which were used, HRDetect 67 and CHORD (version 2.0) 68 , calculate a method-specific probability score of HR deficiency and can be applied to whole-genome sequencing data only. As inputs, they both used raw data comprising single-nucleotide variants, small insertions and deletions, structural variants (detected with SOPHIA, https://bitbucket.org/utoprak/sophia/src/master) and, only in case of HRDetect, copy number variation. All 27 whole-genome sequencing samples with reliable copy number data were therefore used in the analysis. Genes considered as HRD related in Fig. 1 are listed in Supplementary Data 18.
Viral infections. We used three computational approaches to detect viral infections from next-generation sequencing (NGS) data: a k-mer-based approach (Kraken2 version 2.1.2 69 ), an assembly-based approach (P-DiP 70 ), and an alignment-based approach where the sequencing reads were aligned against concatenated assemblies of the human genome and all RefSeq viral genomes, in accordance with Arriba's workflow for the detection of viruses. Kraken2 was considered to make a call if it detected at least one read per 40 million mapped reads as originating from a virus and if at least 10% of the viral genome was covered with reads. For P-DiP, a cutoff of one virus-originating read per million mapped reads was used. Moreover for the Arriba workflow, a sample was considered to be associated with a virus when at least 5% and 100 bp (whichever was bigger) of the viral genome was covered with reads. Supplementary Table 1 lists all viruses that were reported by at least two methods. To detect viral integration sites, we used Arriba version 2.1.0 for RNA-Seq data and VIRUSBreakend version 2.12.0 71 for DNA-Seq data.
Additional data processing and analysis. The downstream analysis was performed in R (version 3.4.3) using Bioconductor repository and such packages as tidyverse (version 1.2.1) 72 , ComplexHeatmap (version 1.99.5) 73 and Biobase (version 2.38.0) 74 . If possible, the sample used for the first MTB was used for the general cohort description, only for CUP-70 we analyzed the sample for the second MTB. PFS, PFSr, and mPFSr were calculated using Microsoft Excel. Survival analysis using Kaplan-Meier estimator and log-rank tests was performed using ggplot2 (version 3.3.3). p-values < 0.05 were considered statistically significant.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
TCGA pan-cancer methylation was retrieved via the curatedTCGAData package 61 . FPKM expression values of the TCGA cohorts were obtained from the GDC data release v22.0. Genome, transcriptome and methylation data generated in this study have been deposited in the European Genome-phenome Archive under the accession number EGAS00001004786. The data are available under controlled access due to the sensitive nature of genome sequencing data, and access can be obtained by contacting the appropriate Data Access Committee listed for each dataset in the study. Access will be granted to commercial and non-commercial parties according to patient consent forms and data transfer agreements for as long as needed. We have an institutional process in place to deal with requests for data transfer and aim for rapid response time. GENCODE (release 19) was used for gene annotation and is publicly available. The raw clinical data are protected and are not available due to data privacy laws. The processed clinical data are available as Supplementary Data files. The remaining data are available within the Article, Supplementary Information, Supplementary Data or Source Data file. Source data are provided with this paper.

Code availability
Bioinformatics analyses were performed using above-mentioned open-source software with parameters as described in each method section. The R script for transcriptomebased entity prediction is available in Supplementary Software 1.